Intentions
While the univariate section showed us the absolute and relative levels of free sulfur dioxide, our other questions of interest have to do with finding relationships between free sulfur dioxide and the other variables in the dataset. In this bivariate section we will first look at a correlation table and scatterplot matrix that will hopefully provide a starting point for finding those correlations. The rest of the bivariate section will expand upon the findings of the correlation table and scatterplot matrix.
Note: As almost every plot from this point forward could arguably address each of the four questions of interest at once, I will cease listing them explicitly and move to a more stream-of-conciousness style to explain my thoughts and efforts.
Correlation Table
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25581431 0.002857966
## fixed.acidity -0.255814305 1.00000000 -0.022697290
## volatile.acidity 0.002857966 -0.02269729 1.000000000
## citric.acid -0.149899918 0.28918070 -0.149471811
## residual.sugar 0.006623775 0.08902070 0.064286060
## chlorides -0.045645192 0.02308564 0.070511571
## free.sulfur.dioxide -0.011928911 -0.04939586 -0.097011939
## total.sulfur.dioxide -0.161979037 0.09106976 0.089260504
## density -0.185976097 0.26533101 0.027113845
## pH -0.115774132 -0.42585829 -0.031915368
## sulphates 0.009807759 -0.01714299 -0.035728147
## alcohol 0.213656245 -0.12088112 0.067717943
## quality 0.035763247 -0.11366283 -0.194722969
## citric.acid residual.sugar chlorides
## X -0.149899918 0.006623775 -0.04564519
## fixed.acidity 0.289180698 0.089020701 0.02308564
## volatile.acidity -0.149471811 0.064286060 0.07051157
## citric.acid 1.000000000 0.094211624 0.11436445
## residual.sugar 0.094211624 1.000000000 0.08868454
## chlorides 0.114364448 0.088684536 1.00000000
## free.sulfur.dioxide 0.094077221 0.299098354 0.10139235
## total.sulfur.dioxide 0.121130798 0.401439311 0.19891030
## density 0.149502571 0.838966455 0.25721132
## pH -0.163748211 -0.194133454 -0.09043946
## sulphates 0.062330940 -0.026664366 0.01676288
## alcohol -0.075728730 -0.450631222 -0.36018871
## quality -0.009209091 -0.097576829 -0.20993441
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.0119289106 -0.161979037 -0.18597610
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## volatile.acidity -0.0970119393 0.089260504 0.02711385
## citric.acid 0.0940772210 0.121130798 0.14950257
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## sulphates 0.0592172458 0.134562367 0.07449315
## alcohol -0.2501039415 -0.448892102 -0.78013762
## quality 0.0081580671 -0.174737218 -0.30712331
## pH sulphates alcohol quality
## X -0.1157741316 0.009807759 0.21365624 0.035763247
## fixed.acidity -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity -0.0319153683 -0.035728147 0.06771794 -0.194722969
## citric.acid -0.1637482114 0.062330940 -0.07572873 -0.009209091
## residual.sugar -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides -0.0904394560 0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide -0.0006177961 0.059217246 -0.25010394 0.008158067
## total.sulfur.dioxide 0.0023209718 0.134562367 -0.44889210 -0.174737218
## density -0.0935914935 0.074493149 -0.78013762 -0.307123313
## pH 1.0000000000 0.155951497 0.12143210 0.099427246
## sulphates 0.1559514973 1.000000000 -0.01743277 0.053677877
## alcohol 0.1214320987 -0.017432772 1.00000000 0.435574715
## quality 0.0994272457 0.053677877 0.43557472 1.000000000
Note that at some points the labels of the scatterplot can get hard to read, but the features are plotted in the same order as their includsion in the correlation table.
Scatterplot Matrix

Looking at the free sulfur dioxide results from the correlation table I noted the strongest linear correlations with residual.sugar (.299), total.sulfur.dioxide (.616), density (.294) and alcohol (-.250) [values rounded].
I am surprised there is not a linear relationship between sulphates and free sulfur dioxide as the data set text file said that sulphates can contribute to free sulfur dioxide levels. At this point I am thinking that maybe the relationsip is non-linear or a pattern might emerge if other (yet unknown) variables are accounted for. But, then again could means sometimes won’t, so we will see.
I also initially expected to see a stronger relationship between free sulfur dioxide and pH (r value: -.001), but I think this was probably short sided considering the logaritmic nature of pH.
pH and Free Sulfur Dioxide
Before moving on to those values that did show a linear correlation with free sulfur dioxide I wanted to create a scatterplot of pH and free sulfur dioxide to see if there were any obvious evidence for a non-linear correlation.

Looking at the scatterplot there does not seem to be any connection, linear or non-linear between pH and free sulfur dioxide. Given those results it seemed doubtful that transfroming pH into linear count of Hydogren ions would reveal anything interesting, but I did it anyway.
Also: The horizontal red line at 50 mg / dm3 free sulfur dioxide is meant to flag where (according to the data set text file) the taste and smell of free sulfur dioxide becomes apparent.

##
## Pearson's product-moment correlation
##
## data: ww$free.sulfur.dioxide and (10^(ww$pH))
## t = -0.4965, df = 4896, p-value = 0.6196
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03509507 0.02091502
## sample estimates:
## cor
## -0.007095588
There’s no obvious relationship after transformation and the only thing I can see of note is that more acidic values seem to be more spaced out than the more tightly clustered more basic values, but this might be an effect of simplying have less wines on with a relatively very low pH.
Free Sulfur Dioxide and Total Sulfur Dioxide
It makes theoretical sense that the more total sulfur dioxide contained in a wine the higher at least the absolute levels (and maybe even the relative levels as well) of free sulfur dioxide to rise in kind. However, we do not yet have evidence that this is true.
##
## Pearson's product-moment correlation
##
## data: ww$total.sulfur.dioxide and ww$free.sulfur.dioxide
## t = 54.6447, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501

The correlation table shows a strong linear relationship between free sulfur dioxide and total sulfur dioxide. The first scatterplot shows the positive correlation as well, but one outlier in particular is causing the axes of the graph to expand out to a size that really condenses the main mass of data. Owing to the size limtations of the format I decided to zoom in on the main mass of data in the next scatterplot.

This graph gives a very nice look at the data. In the scatterplot we can see the linear correlation as well as the apparent tendency for more variability in the free sulfur dioxide levels as total sulfur dioxide rises.
Proportion of Free Sulfur Dioxide and Total Sulfur Dioxide
The next natural question would seem to be does the variability in the proportion of free sulfur dioxide really rise as total sulfur dioxide levels do?
##
## Pearson's product-moment correlation
##
## data: ww$SO2.portion.free and ww$total.sulfur.dioxide
## t = -0.9411, df = 4896, p-value = 0.3467
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04143870 0.01456409
## sample estimates:
## cor
## -0.01344785

The results of the correlation table show that there is very probably not a linear relationship (p-value of 0.35) between the proportion of free sulfur dioxide and total sulfur dioxide.
However, the scatter plot does have some interesting characteristics. Unfortunately, there is overplotting at the size the graph is rendered in the knitted HTML file, but when enlarged there are distinct curivilinear trends inside of the scatterplot. The curves can be seen most clearly in the lower left of the graph. I’m not sure if this is the influence of another variable or if it’s an artifact introduced by graphing a porportion against one of it’s constituent parts.
Proportion of Free Sulfur Dioxide and Absolute Free Sulfur Dioxide
There’s not likely a linear relationship between the proportion of free sulfur dioxide and total sulfur dioxide, but is there one between the proportion of free sulfur dioxide and free sulfur dioxide?
##
## Pearson's product-moment correlation
##
## data: ww$free.sulfur.dioxide and ww$SO2.portion.free
## t = 76.6688, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7256365 0.7511009
## sample estimates:
## cor
## 0.7386321


These graphs and the linear correlation table show the rise in the proportion of unbound sulfur dioxide as absolute levels of free sulfur dioxide rise. (I tentatively assume the total sulfur dioxide is also rising in tandem.) Calculating the coefficient of determination as r^2 = .546, it looks like half of the change in the proportion of free sulfur dioxide can be explained by the rise in absolute free sulfur dioxide levels.
Residual Sugar and Free Sulfur Dioxide
Now that we have looked a sulfur dioxide levels proper, let’s move on to the variables that might be correlated with them.
Residual Sugar showed (for this dataset) a relatively high correlation coefficient of 0.2990984.
##
## Pearson's product-moment correlation
##
## data: ww$residual.sugar and ww$free.sulfur.dioxide
## t = 21.9324, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2733819 0.3243875
## sample estimates:
## cor
## 0.2990984

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
This graph quite clearly suffers from overplotting, especially in the band between 0 and 2 g/dm3 residual sugar. This can be alleviated somewhat on larger computer screens, but working within the limitations of the knitted HTML file lets ignore the very highest values above 20 and replot while setting alpha to 1/5.

This scatterplot is easy to reconcile with the Fig. 10a, and doesn’t appear off much new information.
One question in particular this graph doesn’t answer is what is happening in the still present band of overplotting for the residual sugar values less than 2.5 g/dm3.

In this final graph of residual sugar vs free sulfur dioxide I zoomed in on the values below 2.5 g/dm3. This has the disadvantage of ignorning the other values in the dataset, but I think there is some advantage in having a low level look at the data itself and keeping this image in your minds eye as you look at the complete histograms. Nothing particulary unusual is happening in these values. At this level you can see vertical banding which I would expect originates from the sensor used to measure the residual sugar levels.
Residual Sugar and Total Sulfur Dioxide
Free sulfur dioxide is only one half of the sulfur dioxide equilibrium. I wonder if the relationship between residual sugar and total sulfur dioxide will mirror that of residual sugar and free sulfur dioxide.
##
## Pearson's product-moment correlation
##
## data: ww$residual.sugar and ww$total.sulfur.dioxide
## t = 30.669, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3776791 0.4246712
## sample estimates:
## cor
## 0.4014393

The correlation table shows a much stronger linear correlation between residual sugar and total sulfur dioxide (r value: 0.4014393) that seen with free sulfur dioxide (r value: 0.4014393) .
The general trend up is expected from the correlation coefficient, but the stair step pattern is interesting and is something to investigate further in multivariate analysis.
Residual Sugar and Proportion of Free Sulfur Dioxide
##
## Pearson's product-moment correlation
##
## data: ww$residual.sugar and ww$SO2.portion.free
## t = 3.6034, df = 4896, p-value = 0.0003172
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02345712 0.07932200
## sample estimates:
## cor
## 0.05142979

While there was a small relationship between higher levels of residual sugar and higher levels of free sulfur dioxide (which is the opposite of what I expected to see as sulfur dioxide related ions will bind with sugar molecules), we can see that that there may or may not be a real relationship between residual sugar and the portion of free sulfur dioxide.
Considering these facts together what I think this means is that the higher sugar wines, having more sugar (and probably less alcohol as well), need higher levels of sulfur dioxide in general to protect against oxidation, microbial growth, etc. I think it is these higher levels of sulfur dioxide that are driving the positive residual sugar to free sulfur dioxide relationship.
Density to Free Sulfur Dioxide
Density showed a high linear correlation with free sulfur dioxide.
##
## Pearson's product-moment correlation
##
## data: ww$density and ww$free.sulfur.dioxide
## t = 21.5397, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2684156 0.3195836
## sample estimates:
## cor
## 0.2942104
It’s unlikely density itself is shifting the dynamic equilibrium between free and bound sulfur dioxide. Any relationship between density and free sulfur dioxide is most likely a spurious one and driven by some other variable. But let’s look at the plot.

Taking note of the outliers, let’s replot and look at another look at the data.

There seems to be an interesting shift up in the data centered around .994-.995 g / cm3 density. To describe it I would say it almost looks like a transorm fault between two tectonic plates.
If I had to guess I would say the relationship in general is being driven by the residual sugars / alcohol complex, with lower residual sugars lowering the density and also the amount of bound sulfur dioxide related ions. But at this point though I don’t know why the shift appears the way it does instead of more gentle and linear slope.
Density to Proportion Free Sulfur Dioxide
##
## Pearson's product-moment correlation
##
## data: ww$density and ww$SO2.portion.free
## t = -4.5947, df = 4896, p-value = 4.442e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09335988 -0.03758727
## sample estimates:
## cor
## -0.06552475

Considering the p-value from the Pearson’s product-moment correlation of p = 4.442e-06, the slight very slight (r-value: -0.0655247) negative correlation might be real. Or it might be noise. Either way it’s not very illustrative.
Alcohol to Free Sulfur Dioxide
##
## Pearson's product-moment correlation
##
## data: ww$alcohol and ww$free.sulfur.dioxide
## t = -18.0746, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2761759 -0.2236641
## sample estimates:
## cor
## -0.2501039

There’s a slight trend for the free sulfur dioxide levels to go down as alcohol levels rise. My first thought was this was at least partially caused by the tendency for free sulfur dioxide levels to decrease when surrounded by higher levels of sugar molecules. And in turn the amount of sugar that remains in solution is inversely related to the amount that is converted to alcohol by the fermentation process.
Alcohol to Proportion of Free Sulfur Dioxide
If absolute levels of free sulfur dixoide tend to decrease as alcohol increases, and higher levels of free sulfur dioxide are associated with a higher proportion of free sulfur dioxide - does alcohol show a correlation with the proportion of free sulfur dioxide?

The scatterplot doesn’t seem to show any relationship, but let’s also run a Pearson’s product-moment correlation test.
##
## Pearson's product-moment correlation
##
## data: ww$alcohol and ww$SO2.portion.free
## t = 4.5202, df = 4896, p-value = 6.324e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03652591 0.09230621
## sample estimates:
## cor
## 0.06446642
The result is statistically significant (p-value: 6.323508910^{-6} ), but small (0.0644664).
Alcohol and Residual Sugar
Now that we have looked at the variables that the correlation table showed linear relationships with free sulfur dioxide, let us look at those variables that I expected would show a relationship with free sulfur dioxide based off of the dataset text file. Also lets explore how Alcohol, Residual Sugar and Density are related.

##
## Pearson's product-moment correlation
##
## data: ww$alcohol and ww$residual.sugar
## t = -35.3209, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312

These graphs show the general inverse relationship between alcohol and residual sugar. What was interesting to me was the initial rise in alcohol content in the very lowest sugar wines.
Knowning know that there would be a thick band of overplotting at residual sugars around 5, I decided to flip the graph to get a better look at the scatterplot between alcohol and residual sugars.
Density and Residual Sugar

Sulphates and Free Sulfur Dioxide
##
## Pearson's product-moment correlation
##
## data: ww$sulphates and ww$free.sulfur.dioxide
## t = 4.1508, df = 4896, p-value = 3.369e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03126264 0.08707928
## sample estimates:
## cor
## 0.05921725

Taking a closer look at sulphates and free sulfur dioxide then the scatterplot matrix could provide it’s still not apparent to me that there is any sort of real relationship between these two variables. We’ll see if anything shows up in multivariate analysis.
Sulphates and Proportion of Free Sulfur Dioxide
##
## Pearson's product-moment correlation
##
## data: ww$sulphates and ww$SO2.portion.free
## t = -1.5651, df = 4896, p-value = 0.1176
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05033679 0.00564813
## sample estimates:
## cor
## -0.02236186

Rather unsurprisingly given what we have seen concerning sulphates so far, there does not appear to be any relation between sulphates and the proportion of free sulfur dioxide.
Quality and Free Sulfur Dioxide
Finally let’s see if there is any relation between quality and our main features of interest 1. Free Sulfur Dioxide, 2. Total Sulfur Dioxide and 3. Proportion of Free Sulfur Dioxide.
Starting with Free Sulfur Dioxide.

##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The median and means for qualities 5, 6, 7, and 8 appear to essentially identical. Quality rating 3 has a few outliers that shift the mean toward a higher value, but it’s median put’s it in line with 5, 6, 7, and 8. There might be some relation between quality rating 4 and relatively lower free sulfur dioxide, but I think this is almost assuredly created by noise. It’s hard to see a real link for free sulfur dioxide to be lower on wines that are specifically the second worst.
Interestingly the wines with the highest and second highest free sulfur dioxide count were both rated the lowest (rating 3). There were 17 wines with free sulfur dioxide levels over 100 mg/dm3. Seven of them were rated 6, 7, or 8. Ten of them were rated 3, 4, or 5. There were 3253 wines rated 6, 7, or 8 and there were 1640 wines rated 3, 4, or 5.
What most strikes me from these boxplots is that in no quality bracket does the 3rd quartile extend past the threshold of 50 mg / dm3 free sulfur dioxide.
Quality to Total Sulfur Dioxide

The most striking result for these boxplots is not their central tendencies, but that the higher the quality of the wine the smaller the total variability in total SO2 levels. Similiar to the boxplots quality and free sulfur dioxide the two wines with highest levels of total sulfur dioxide were rated of the lowest quality (rating 3). This is not that surprising given that total sulfur dioxide levels probably have the strongest effect on the levels of free sulfur dioxide.
Similiar to the quality and free sulfur dioxide boxplots rating 3 has a higher median than rating 4 and rating 4 has a lower median than 5 and 6.
Quality to Proportion of Free Sulfur Dioxide
Finally is there any association between quality rankings and the proportion of free sulfur dioxide in the wines in general?

While the trend is not consistent at the lowest and highest quality ranking levels, there does seem a slight trend for the proportion of free sulfur dioxide to increase as does quality. In some sense it is less meaningful that quality ranking 3 (lowest) and 9 (highest) are not consistent with the trend, for purposes of deciding if the trend is real and generalizable to the wines in general, because of the very low number of data points at those quality rankings. However, it might be that for the wines ranked 4 though 8 there is a tendency for the proportion of free sulfur dioxide to rise as does the quality ranking.
Mean of The Proportion of Free Sulfur Dioxide Grouped by Quality Ranking

This plot shows the means of the subsets of proportion of free sulfur dioxide grouped by quality ranking. The error bars treat the quality subsets as samples from a larger population of wine of that quality ranking and show the 95 % Confidence Interval for the mean.